6 research outputs found
PADIC: extension and new experiments
International audiencePADIC is a multidialectal parallel Arabic corpus. It was composed initially by five Arabic dialects, three from the Maghreb and two from the Middle East, in addition to standard Arabic. In this paper, we present an augmented version of PADIC with a Moroccan dialect. We give also an evaluation, using the σ–index, of the computerization level of the Arabic dialects present in PADIC which reveals that these languages are really under-resourced. Several experiments in machine translation, in both sides between all the combinations of language pairs, are discussed too. For each language, we interpolated the corresponding Language Model (LM) with a large Arabic corpus based LM. The results show that this interpolation is in some cases without effect on the performances of translation systems and in others is rather penalizing
Cross-Lingual Semantic Similarity Measure for Comparable Articles
International audienceWe aim in this research to find and compare crosslingual articles concerning a specific topic. So, we need measure for that. This measure can be based on bilingual dictionaries or based on numerical methods such as Latent Semantic Indexing (LSI). In this paper, we use the LSI in two ways to retrieve Arabic-English comparable articles. The first one is monolingual: the English article is translated into Arabic and then mapped into the Arabic LSI space; the second one is crosslingual: Arabic and English documents are mapped into Arabic-English LSI space. Then, we compare LSI approaches to the dictionary-based approach on several English-Arabic parallel and comparable corpora. Results indicate that the performance of cross-lingual LSI approach is competitive to monolingual approach, or even better for some corpora. Moreover, both LSI approaches outperform the dictionary approach
PADIC: extension and new experiments
International audiencePADIC is a multidialectal parallel Arabic corpus. It was composed initially by five Arabic dialects, three from the Maghreb and two from the Middle East, in addition to standard Arabic. In this paper, we present an augmented version of PADIC with a Moroccan dialect. We give also an evaluation, using the σ–index, of the computerization level of the Arabic dialects present in PADIC which reveals that these languages are really under-resourced. Several experiments in machine translation, in both sides between all the combinations of language pairs, are discussed too. For each language, we interpolated the corresponding Language Model (LM) with a large Arabic corpus based LM. The results show that this interpolation is in some cases without effect on the performances of translation systems and in others is rather penalizing
Constitution d'un corpus de la langue Arabe à partir du Web
International audienceLa toile est une source intarissable de données textuelles. Ces dernières années la communauté travaillant sur les différents aspects de la langue s'est tournée vers le web afin de bénéficier de cette masse impressionnante d'informations. Cet article décrit un outil de construction de corpus pour l'Arabe. Il permet de recueillir automatiquement une liste de sites dédiés à la langue Arabe. Ensuite le contenu de ces sites est extrait et est normalisé. Le corpus ainsi constitué peut être utilisé dans diverses applications de traitement du langage naturel et plus particulièrement dans le calcul de modèles de langage statistiques
Constitution d'un corpus de la langue Arabe à partir du Web
International audienceLa toile est une source intarissable de données textuelles. Ces dernières années la communauté travaillant sur les différents aspects de la langue s'est tournée vers le web afin de bénéficier de cette masse impressionnante d'informations. Cet article décrit un outil de construction de corpus pour l'Arabe. Il permet de recueillir automatiquement une liste de sites dédiés à la langue Arabe. Ensuite le contenu de ces sites est extrait et est normalisé. Le corpus ainsi constitué peut être utilisé dans diverses applications de traitement du langage naturel et plus particulièrement dans le calcul de modèles de langage statistiques
PADIC: extension and new experiments
International audiencePADIC is a multidialectal parallel Arabic corpus. It was composed initially by five Arabic dialects, three from the Maghreb and two from the Middle East, in addition to standard Arabic. In this paper, we present an augmented version of PADIC with a Moroccan dialect. We give also an evaluation, using the σ–index, of the computerization level of the Arabic dialects present in PADIC which reveals that these languages are really under-resourced. Several experiments in machine translation, in both sides between all the combinations of language pairs, are discussed too. For each language, we interpolated the corresponding Language Model (LM) with a large Arabic corpus based LM. The results show that this interpolation is in some cases without effect on the performances of translation systems and in others is rather penalizing